@*
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*@

<p>
    This analysis shows how well the number of mappers is adjusted.<br>
    This should allow you to better tweak the number of mappers for your job.<br>
    There are two possible situations that needs some tuning.
</p>
<h3>Mapper time too short</h3>
<p>
    This usually happens when the Hadoop job has:
<ul>
    <li>A <b>large</b> number of mappers</li>
    <li><b>Short</b> mapper avg runtime</li>
    <li><b>Small</b> file size</li>
</ul>
</p>
<h5>Example</h5>
<p>
<div class="list-group">
    <a class="list-group-item list-group-item-danger" href="#">
        <h4 class="list-group-item-heading">Mapper Input Size</h4>
        <table class="list-group-item-text table table-condensed left-table">
            <thead><tr><th colspan="2">Severity: Critical</th></tr></thead>
            <tbody>
            <tr>
                <td>Number of tasks</td>
                <td>1516</td>
            </tr>
            <tr>
                <td>Average task input size</td>
                <td>19 KB</td>
            </tr>
            <tr>
                <td>Average task runtime</td>
                <td>1min 32s</td>
            </tr>
            </tbody>
        </table>
    </a>
</div>
</p>
<h4>Suggestions</h4>
<p>
    You should tune mapper split size to reduce number of mappers and let each mapper process larger data <br>
    The parameters for changing split size are: <br>
    <ul>
    <li>mapreduce.input.fileinputformat.split.minsize/maxsize</li>
    <li>pig.maxCombinedSplitSize (Pig Only)</li>
    </ul>
    Examples on how to set them:
    <ul>
    <li>HadoopJava: conf.setLong("mapreduce.input.fileinputformat.split.minsize", XXXXX)</li>
    <li>Pig: set mapreduce.input.fileinputformat.split.minsize XXXXX </li>
    <li>Hive: set mapreduce.input.fileinputformat.split.minsize=XXXXX</li>
    </ul>

    The split size is controlled by formula <b>max(minSplitSize, min(maxSplitSize, blockSize))</b>. By default,
    blockSize=512MB and minSplit < blockSize < maxSplit. <br>
    You should always refer to this formula.<br>
    In the case above, try <b>increasing min split size</b> and let each mapper process larger data.<br>
    <strong>[Note]</strong> By default HadoopJava will not combine small files, so each mapper cannot process more than
    one file, and changing split size won't help. If that is your case, you should either try CombineFileInputFormat or
    use Pig/Hive.
   <br>
    See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.<br>
</p>
<h3>Large files/Unsplittable files</h3>
<p>
    This usually happens when the Hadoop job has:
<ul>
    <li>A <b>small</b> number of mappers</li>
    <li><b>Long</b> mapper avg runtime</li>
    <li><b>Large</b> file size (a few GB's)</li>
</ul>
</p>
<h5>Example</h5>
<p>
<div class="list-group">
    <a class="list-group-item list-group-item-danger" href="#">
        <h4 class="list-group-item-heading">Mapper Input Size</h4>
        <table class="list-group-item-text table table-condensed left-table">
            <thead><tr><th colspan="2">Severity: Critical</th></tr></thead>
            <tbody>
            <tr>
                <td>Number of tasks</td>
                <td>50</td>
            </tr>
            <tr>
                <td>Average task input</td>
                <td>3 GB</td>
            </tr>
            <tr>
                <td>Average task runtime</td>
                <td>52 min 30s</td>
            </tr>
            </tbody>
        </table>
    </a>
</div>
</p>
<h4>Suggestions</h4>
<p>
    The split size is too large. You should tune mapper split size to increase number of mappers and let each mapper
    process less data.<br>
    <br>
    The input split size is controlled by formula <b>max(minSplitSize, min(maxSplitSize, blockSize))</b>. See the
    previous section for further details. <br>
    In the case above, since mapper input size >> block size and you want to increase mappers, you should decrease min split size close to BlockSize(512MB). <br>
    See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.
</p>
